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Abstract 

The  Defense  Advanced  Research  Projects  Agency  (DARPA)  High  Productivity  Computing  Systems 
(HPCS)  HPCchallenge  Benchmarks  examine  the  performance  of  High  Performance  Computing  (HPC) 
architectures  using  kernels  with  more  challenging  memory  access  patterns  than  just  the  High  Performance 
LINPACK  (HPL)  benchmark  used  in  the  Top500  list.  The  HPCchallenge  Benchmarks  build  on  the  HPL 
framework  and  augment  the  TopSOO  list  by  providing  benchmarks  that  bound  the  performance  of  many  real 
applications  as  a  function  of  memory  access  locality  characteristics.  The  real  utility  of  the  HPCchallenge 
benchmarks  are  that  architectures  can  be  described  with  a  wider  range  of  metrics  than  just  Flop/s  from  HPL. 
Even  a  small  percentage  of  random  memory  accesses  in  real  applications  can  significantly  affect  the  overall 
performance  of  that  application  on  architectures  not  designed  to  minimize  or  hide  memory  latency.  The 
HPCchallenge  Benchmarks  includes  a  new  metric  —  Giga  UPdates  per  Second  —  and  a  new  benchmark 
—  RandomAccess  —  to  measure  the  ability  of  an  architecture  to  access  memory  randomly,  i.e.,  with  no 
locality.  When  looking  only  at  HPL  performance  and  the  TopSOO  List,  inexpensive  build-your-own 
clusters  appear  to  be  much  more  cost  effective  than  more  sophisticated  HPC  architectures.  HPCchallenge 
Benchmarks  provide  users  with  additional  information  to  justify  policy  and  purchasing  decisions.  We  will 
compare  the  measured  HPCchallenge  Benchmark  performance  on  various  HPC  architectures  —  from  Cray 
Xls  to  Beowulf  clusters  —  in  the  presentation  and  paper.  Additional  information  on  the  HPCchallenge 
Benchmarks  can  be  found  at  http://icl.cs.utk.edu/hpcc/ 

Introduction 

At  SC2003  in  Phoenix  (15-21  November  2003),  Jack  Dongarra  (ICL/UT)  announced  the  release  of  a  new 
benchmark  suite  —  the  HPCchallenge  Benchmarks  —  that  examine  the  performance  of  HPC  architectures 
using  kernels  with  more  challenging  memory  access  patterns  than  High  Performance  Linpack  (HPL)  used 
in  the  TopSOO  list.  The  HPCchallenge  Benchmarks  are  being  designed  to  complement  the  TopSOO  list  and 
provide  benchmarks  that  bound  the  performance  of  many  real  applications  as  a  function  of  memory  access 
characteristics  —  e.g.,  spatial  and  temporal  locality.  Development  of  the  HPCchallenge  Benchmarks  is 
being  funded  by  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  High  Productivity  Computing 
Systems  (HPCS)  Program. 
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Flop/s 

The  Flop/s  metric  from  HPL  has  been  the  de  facto  standard  for  comparing  High  Performance  Computers 
for  many  years.  HPL  works  well  on  all  architectures  —  even  cache-based,  distributed  memory 
multiprocessors  —  and  the  measured  performance  may  not  be  representative  of  a  wide  range  of  real  user 
applications  like  adaptive  multi-physics  simulations  used  in  weapons  and  vehicle  design  and  weather, 
climate  models,  and  defense  applications.  HPL  is  more  compute  friendly  than  these  applications  because  it 
has  more  extensive  memory  reuse  in  the  Level  3  BLAS-based  calculations.  . 

Memory  Performance 

There  is  a  need  for  benchmarks  that  test  memory  performance.  When  looking  only  at  HPL  performance 
and  the  TopSOO  List,  inexpensive  build-your-own  clusters  appear  to  be  much  more  cost  effective  than  more 
sophisticated  HPC  architectures.  HPL  has  high  spatial  and  temporal  locality  —  characteristics  shared  by 
few  real  user  applications.  HPCchallenge  benchmarks  provide  users  with  additional  information  to  justify 
policy  and  purchasing  decisions 

Not  only  does  the  Japanese  Earth  Simulator  outperform  the  top  American  systems  on  the  HPL  benchmark 
(Tflop/s),  the  differences  in  bandwidth  performance  on  John  McCalpin’s  STREAM  TRIAD  benchmark 
(Level  1  BLAS)  shows  even  greater  performance  disparity.  The  Earth  Simulator  outperforms  the  ASCI  Q 
by  a  factor  of  4.64  on  HPL.  Meanwhile,  the  higher  bandwidth  memory  and  interconnect  systems  of  the 
Earth  Simulator  are  clearly  evident  as  it  outperforms  ASCI  Q  by  a  factor  of  36.25  on  STREAM  TRIAD.  In 
the  presentation  and  paper,  we  will  compare  the  measured  HPCchallenge  Benchmark  performance  on 
various  HPC  architectures  —  from  Cray  XI s  to  Beowulf  clusters  —  using  the  updated  results  at 
http://icl.cs.utk.edu/hpcc/hpcc_results.cgi 

Even  a  small  percentage  of  random  memory  accesses  in  real  applications  can  significantly  affect  the  overall 
performance  of  that  application  on  architectures  not  designed  to  minimize  or  hide  memory  latency. 
Memory  latency  has  not  kept  up  with  Moore’s  Law.  Moore’s  Law  hypothesizes  a  60%  compound  growth 
rate  per  year  for  microprocessor  “performance”,  while  memory  latency  has  been  improving  at  a  compound 
rate  of  only  7%  per  year.  The  memory-processor  performance  gap  has  been  growing  at  a  rate  of  over  50% 
per  year  since  1980.  The  HPCchallenge  Benchmarks  includes  a  new  metric  —  Giga  UPdates  per  Second 
—  and  a  new  benchmark  —  RandomAccess  —  to  measure  the  ability  of  an  architecture  to  access  memory 
randomly,  i.e.,  with  no  locality. 

GUPS  is  calculated  by  identifying  the  number  of  memory  locations  that  can  be  randomly  updated 
in  one  second,  divided  by  1  billion  (le9).  The  term  ‘‘randomly”  means  that  there  is  little 
relationship  between  one  address  to  be  updated  and  the  next,  except  that  they  occur  in  the  space  of 
!/2  the  total  system  memory.  An  update  is  a  read-modify- write  operation  on  a  table  of  64-bit  words. 
An  address  is  generated,  the  value  at  that  address  read  from  memory,  modified  by  an  integer 
operation  (add,  and,  or,  xor)  with  a  literal  value,  and  that  new  value  is  written  back  to  memory 
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High  Productivity  Computing  System 


>  Create  a  new  generation  of  economically  viable  computing  systems  and  a 
procurement  methodology  for  the  security/industrial  community  (2007  -  2010) 


Impact: 

•  Performance  (time-to-solution):  speedup  critical  national 
security  applications  by  a  factor  of  10X  to  40X 

•  Programmability  (idea-to-first-solution):  reduce  cost  and 
time  of  developing  application  solutions 

•  Portability  (transparency):  insulate  research  and 
operational  application  software  from  system 

•  Robustness  (reliability):  apply  all  known  techniques  to 
protect  against  outside  attacks,  hardware  faults,  & 
programming  errors 


Applications: 


Anatys'®  8^  ^®sessmenf 

\odustnf _ R&D 

Performance  Prograrnnijng 
Characterization  Models 

SI  Prediction 

Hardware 

System  Technoiogy 

Architecture  software 
Technoiogy 

^•^dustiy  R&O 


HPCS  Program  Focus  Areas 


•  Intelligence/surveillance,  reconnaissance,  cryptanalysis,  weapons  analysis,  airborne  contaminant 
modeling  and  biotechnology 


Fill  the  Critical  Technology  and  Capability  Gap 
Today  (late  80’s  HPC  technology) . to . Future  (Quantum/Bio  Computing) 


High  Productivity  Computing  Systems 

-Program  Overview- 


>  Create  a  new  generation  of  economically  viable  computing  systems  and  a 
procurement  methodology  for  the  security/industrial  community  (2007  -  2010) 


Full  Scale 
Development 


Half-Way  Point 
Phase  2 


Technology 

Assessment 

Review 


Advanced 
Design  & 
Prototypes 


Concept 

Study 


Petascale/s  Systems 


Validated  Procurement 
Evaluation  Methodology 


Test  Evaluation 
Framework 


New  Evaluation 
Framework 


Phase  1 


Phase  2 
(2003-2005) 


Phase  3 
(2006-2010) 
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HPCS  Program  Goals^ 


HPCS  overall  productivity  goals: 

-  Execution  (sustained  performance) 

■  1  Petaflop/sec  (scalable  to  greater  than  4  Petaflop/sec) 

■  Reference:  Production  workflow 


-  Development 

■  1 0X  over  today’s  systems 

■  Reference:  Lone  researcher  and  Enterprise  workflows 


Productivity  Framework 

-  Base  lined  for  today’s  systems 

-  Successfully  used  to  evaluate  the  vendors  emerging 
productivity  techniques 

-  Provide  a  solid  reference  for  evaluation  of  vendor’s  proposed 
Phase  III  designs. 


Subsystem  Performance  Indicators 

1)  2+ PF/s  LINPACK 

2)  6.5  PB/sec  data  STREAM  bandwidth 

3)  3.2  PB/sec  bisection  bandwidth 

4)  64,000  GUPS 


tBob  Graybill  (DARPA/IPTO) 
(Emphasis  added) 
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Processor-Memory  Performance  Gap 


1000 


CPU 


pProc 

60%/yr. 


Processor-Memory 
Performance  Gap: 
(grows  50%  /  year) 

DRAM 
7%/yr. 


DRAM 


Ot-  <MCO^lO<OI^OOO>OT-CMCO^m<OI^COO> 
0000  000000000000000000)0)00)00)00)0> 
OO  000)0>0000)0>000000000> 


CM 


•Alpha  21264  full  cache  miss  /  instructions  executed: 

180  ns/1.7  ns  =108  elks  x  4  or  432  instructions 

•  Caches  in  Pentium  Pro:  64%  area,  88%  transistors 
*Taken  from  Patterson-Keeton  Talk  to  SigMod 
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Processing  vs.  Memory  Access 


*  Doesn’t  cache  solve  this  problem? 

-  It  depends.  With  small  amounts  of  contiguous  data,  usually. 
With  large  amounts  of  non-contiguous  data,  usually  not 

-  In  most  computers  the  programmer  has  no  control  over 
cache 

-  Often  “a  few”  Bytes/FLOP  is  considered  OK 

*  However,  consider  operations  on  the  transpose  of  a  matrix 
(e.g.,  for  adjunct  problems) 

-  Xa=  b  XTa  =  b 

-  If  X  is  big  enough,  100%  cache  misses  are  guaranteed,  and 
we  need  at  ieast  8  Bytes/FLOP  (assuming  a  and  b  can  be  heid 
in  cache) 

*  Latency  and  limited  bandwidth  of  processor-memory  and 
node-node  communications  are  major  limiters  of 
performance  for  scientific  computation 

MITRE  ICL/UTK  — 
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Processing  vs.  Memory  Access 
High  Performance  UNPACK 


Consider  another  benchmark:  Linpack 

Ax  =  b 


Solve  this  linear  equation  for  the  vector  x,  where  A  is  a 
known  matrix,  and  b  is  a  known  vector.  Linpack  uses  the 
BLAS  routines,  which  divide  A  into  blocks. 


On  the  average  Linpack  requires  1  memory  reference  for  every 

2  FLOPS,  or  4Bytes/Flop. 

Many  of  these  can  be  cache  references 
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Processing  vs.  Memory  Access 
STREAM  TRIAD 


Consider  the  simple  benchmark:  STREAM  TRIAD 


a(i)  =  b(i)  +  q  *  c(i) 


a  (i) ,  b  (i) ,  and  c  (i)  are  vectors;  q  is  a  scalar 
Vector  length  is  chosen  to  be  much  ionger  than  cache  size 


Each  execution  inciudes 
2  memory  ioads  +  1  memory  store 

2  FLOPS 

12  Bytes/FLOP  (assuming  32  bit  precision) 


No  computer  has  enough  memory  bandwidth  to  reference 

12  Bytes  for  each  FLOP! 
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Processing  vs.  Memory  Access 
RandomAccess 


The  expected  value  of  the  number 


Data-Driven 
Memory  Access 


3; 

64  bits 


Acceptable  Error  —  1% 

Look  ahead  and  Storage  —  1024  per  “node” 
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Bounding  Mission  Partner 
Appiications 


Outline 
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HPCS  HPCchallenge  Benchmarks 


*  HPCSchallenge  Benchmarks 

-  Being  developed  by  Jack  Dongarra  (ICL/UT) 

-  Funded  by  the  DARPA  High  Productivity 
Computing  Systems  (HPCS)  program 
(Bob  Graybill  (DARPA/IPTO)) 


To  examine  the  performance  of  High  Performance 
Computer  (HPC)  architectures  using  kernels  with 
more  challenging  memory  access  patterns  than 
High  Performance  Linpack  (HPL) 
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HPCchallenge  Goals 


*  To  examine  the  performance  of  HPC 
architectures  using  kernels  with  more 
challenging  memory  access  patterns  than  HPL 

-  HPL  works  well  on  all  architectures  —  even  cache- 
based,  distributed  memory  multiprocessors  due  to 

1.  Extensive  memory  reuse 

2.  Scalable  with  respect  to  the  amount  of  computation 

3.  Scaiable  with  respect  to  the  communication  voiume 

4.  Extensive  optimization  of  the  software 

*  To  complement  the  TopSOO  list 

*  To  provide  benchmarks  that  bound  the 
performance  of  many  real  applications  as  a 
function  of  memory  access  characteristics  — 
e.g.,  spatial  and  temporal  locality 
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HPCchallenge  Benchmarks 


Local 

•  DGEMM  (matrix  x  matrix  multiply) 

•  STREAM 

-  COPY 

-  SCALE 

-  ADD 

-  TRIADD 

•  EP-RandomAccess 

•  1DFFT 

Global 

•  High  Performance  UNPACK  (HPL) 

•  PTRANS  —  parallel  matrix 
transpose 

•  G-RandomAccess 

•  1DFFT 

•  b_eff  —  interprocessor  bandwidth 
and  latency 


iCL^cr  mitre 


*  HPCchallenge  pushes  spatial  and  temporal  boundaries;  sets  performance  bounds 

*  Available  for  download  http://icl.cs.utk.edu/hpcc/ 
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Web  Site 

http://icl.cs.utk.edu/hpcc/ 


•  Home 

•  Rules 

•  News 

•  Download 

•  FAQ 

•  Links 

•  Collaborators 

•  Sponsors 

•  Upload 

•  Results 
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HPC  Challenge 


HorTK 


HI^C  ClTallenge  Benchmark 


rAO 

IJnks 

Cjnlla  IxtnntQrc 
Spunsars 
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Mrwar  G/E^luni  uF  «juaLkHiG. 
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multlpbratton- 


LIpkKid 

PIlOHjItG 


3.  jjlRLAM  -  fl  simple  syfifhotlc  henohmark  program  that  frrasunos  SLtsrainnbIr;  nmnonry  bnndwktrti  (In 
GH/£)aiKl  IficcjurTospuiidlFm  t»[riixiliit.lorr  laic  Per  sliTipIo  vottor  kuFiuL 


'I.  PTRAH5  (porallc^l  mEitrlj;  transpose}  tli^  communloihor^  whene  poJra  oi  pripoPssririE 

onnunEinlcTitiP  wllti  moh  nftier  filmtiUnneoiiaE^.  Ut  Ea  a  uaefLil  inst  of  ttie  tntal  cinminnnlQrHnns  ciipEidty  of 
ttie  network. 


KiindomAE:De39  measures  TtiR  ratio  of  Intieger  rancfoni  updFutRS  of  memory  (OtlP^}- 

FFTE  iriBasoFca!  Uie  Tlofitiriy  jJuirtL  ratje  uF  exeLuUori  oF  duuljk  paeLtsloi!]  cOrrfpfeii  one  clIiTKirislonal 
DkfCietH  Fourier  TrUrtroriri  (OFT). 


7,  (oit«tfvio  berxfwldth  benchmark}  a  9^1 tests  tp  measure  latency  a  nd  bandwidth  of  a  nu  mber 

Cil  simutaneous  communiirmtion  peHems 
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Latest  IffCC  News 

Linux  dusters  give  HPC  orice-performBnce 
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Terry,  ctiF^  tedi-mlogy  uFIkei  ul  Oornabyj  Drlttsh  Colurnbla^based  Cray  Canada.  In  Uils  inbei view,  Terry 
exgtainS  Uiul  assci Uon  and  describes  Qay's  new  U no x-L>ascd  XDl  systuirij  nflldi  wlli  be  fa'lccd  curnpcUUvel/ 
wlUi  oLtier  typeG  nF  tikjHi  end  Umi it  cluErlerS.  Read  more... 


Fast  but  going  nowhere 
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Preliminary  Results 
Machine  List  (1  of  2) 


Affiliatio 

n 

Manufacturer 

System 

ProcessorType 

Procs 

U  Tenn 

Atipa  Cluster  AMD  128  procs 

Conquest  cluster 

AMD  Opteron 

128 

AHPCRC 

Cray  XI  124  procs 

XI 

Cray  XI  MSP 

124 

AHPCRC 

Cray  XI  124  procs 

XI 

Cray  XI  MSP 

124 

AHPCRC 

Cray  XI  124  procs 

XI 

Cray  XI  MSP 

124 

ERDC 

Cray  XI  60  procs 

XI 

Cray  XI  MSP 

60 

ERDC 

Cray  XI  60  procs 

XI 

Cray  XI  MSP 

60 

ORNL 

Cray  XI  252  procs 

XI 

Cray  XI  MSP 

252 

ORNL 

Cray  XI  252  procs 

XI 

Cray  XI  MSP 

252 

AHPCRC 

Cray  XI  120  procs 

XI 

Cray  XI  MSP 

120 

ORNL 

Cray  XI  64  procs 

XI 

Cray  XI  MSP 

64 

AHPCRC 

Cray  T3E  1 024  procs 

T3E 

Alpha  21 164 

1024 

ORNL 

HP  zx6000  Itanium  2  128  procs 

Integrity  zx6000 

Intel  Itanium  2 

128 

PSC 

HP  AlphaServer  SC45  128  procs 

AlphaServer  SC45 

Alpha  21 264B 

128 

ERDC 

HP  AlphaServer  SC45  484  procs 

AlphaServer  SC45 

Alpha  21 264B 

484 
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Preliminary  Results 
Machine  List  (2  of  2) 


Affiliation 

Manufacturer 

System 

ProcessorType 

Procs 

IBM 

IBM  655  Power4+  64  procs 

eServer  pSeries  655 

IBM  Power  4+ 

64 

IBM 

IBM  655  Power4+  128  procs 

eServer  pSeries  655 

IBM  Power  4+ 

128 

IBM 

IBM  655  Power4+  256  procs 

eServer  pSeries  655 

IBM  Power  4+ 

256 

NAVO 

IBM  p690  Power4  504  procs 

p690 

IBM  Power  4 

504 

ARL 

IBM  SP  Powers  512  procs 

RS/6000  SP 

IBM  Power  3 

512 

ORNL 

IBM  p690  Power4  256  procs 

p690 

IBM  Power  4 

256 

ORNL 

IBM  p690  Power4  64  procs 

p690 

IBM  Power  4 

64 

ARL 

Linux  Networx  Xeon  256  procs 

Powell 

Intel  Xeon 

256 

U  Manchester 

SGI  Altix  Itanium  2  32  procs 

Altix  3700 

Intel  Itanium  2 

32 

ORNL 

SGI  Altix  Itanium  2  128  procs 

Altix 

Intel  Itanium  2 

128 

U  Tenn 

SGI  Altix  Itanium  2  32  procs 

Altix 

Intel  Itanium  2 

32 

U  Tenn 

SGI  Altix  Itanium  2  32  procs 

Altix 

Intel  Itanium  2 

32 

U  Tenn 

SGI  Altix  Itanium  2  32  procs 

Altix 

Intel  Itanium  2 

32 

U  Tenn 

SGI  Altix  Itanium  2  32  procs 

Altix 

Intel  Itanium  2 

32 

NASA  ASC 

SGI  Origin  23900  R16K  256  procs 

Origin  3900 

SGI  MIPS  R16000 

256 

U  Aachen/RWTH 

SunFire  15K  128  procs 

Sun  Fire  15k/6800  SMP-Cluster 

Sun  UltraSparc  III 

128 

OSC 

Voltaire  Cluster  Xeon  128  procs 

Pinnacle  2X200  Cluster 

Intel  Xeon 

128 
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STREAM  TRIAD  vs  HPL 
120-128  Processors 


Basic  Performance 
120-128  Processors 
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STREAM  TRIAD  vs  HPL 
>252  Processors 
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Basic  Performance 
>=252  Processors 


\  ’^‘5'  o 

^  ^  ^  ^  ^  ^ 


(9^ 


\ 


% 


'o 


□  EP-STREAM 
TRIAD  Tflop/s 

□  HPL 
TFIop/s 


STREAM  TRIAD 

a(i)  =  b(i)  +  q  *c(i) 

HPL 

Ax  =  b 

■>  \  %  %  V,  " 

A.  \  %  \ 

\\  W\ 


& 


MITRE 


ICL/UTK 


Slide-22 

HPCchallenge  Benchmarks 


STREAM  ADD  vs  PTRANS 
60-128  Processors 
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STREAM  ADD  vs  PTRANS 
>252  Processors 


Basic  Performance 
>=252  Processors 
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Outline 


*  Brief  DARPA  HPCS  Overview 

*  Architecture/Application  Characterization 

*  HPCchallenge  Benchmarks 

*  Preliminary  Results 

*  Summary 
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Summary 


*  DARPA  HPCS  Subsystem  Performance  Indicators 

-  2+  PF/s  LINPACK 

-  6.5  PB/sec  data  STREAM  bandwidth 

-  3.2  PB/sec  bisection  bandwidth 

-  64,000  GUPS 

*  Important  to  understand  architecture/application  characterization 

-  Where  did  all  the  lost  “Moore’s  Law  performance  go?” 

*  HPCchallenge  Benchmarks  —  http://icl.cs.utk.edu/hpcc/ 

-  Peruse  the  results! 
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